A Framework for Using Tesseract to Transcribing Early Modern Texts Having Non-standard Fonts

نویسنده

  • William Frederick
چکیده

Here we describe a framework built upon Tesseract optical character recognition software for transcribing old texts having non-standard fonts. Further, we illustrate our software on creating a digital version of two volumes of a 17 century French text. The volumes consist of 808 pages having 84,366 words, and our system initially correctly transcribes 88% of the words. Further, we identify a methodology that will help to correct an additional 1,007 words; this would lead to 89% recognition accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مطالعۀ ارگونومیک پارامترهای تایپوگرافی در قلم های نوشتاری فارسی

Abstract Introduction The extensive development of written interactions in the current world of technology in one hand, and on the other hand noticeable dominance of English language in this milieu, has led to inadequate utilization of Farsi in such settings, even amongst native speakers. Lack of experimental data regarding legibility and readability of the printed and electronic texts related ...

متن کامل

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit...

متن کامل

Memory Implementations - Current Alternatives

In an attempt to ensure good-quality printouts of our technical reports, from the supplied PDF files, we process to PDF using Acrobat Distiller. We encourage our authors to use outline fonts coupled with embedding of the used subset of all fonts (in either Truetype or Type 1 formats) except for the standard Acrobat typeface families of Times, Helvetica (Arial), Courier and Symbol. In the case o...

متن کامل

Mathematical Font Art

Currently, only a limited number of fonts are available for high quality mathematical typesetting, such as Knuth's computer modern font, the Stix font, and several fonts from the TEX Gyre family. An interesting challenge is to develop tools which allow users to pick any existing favorite font and to use it for writing mathematical texts. We will present progress on this problem as part of recen...

متن کامل

Unsupervised Transcription of Historical Documents

We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015